My motivation for this project is to explore birth rates across the US states from 2000-2014. Birth rates are a defining factor in population and general welfare across the US. It’s interesting to see the fluctuation in rates across years, as well as across states. We can infer aspects such as the economy, the existing populations in states, as well as the comparison between child-bearing age citizens vs. non-childbearing ages in certain states.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
library(broom)
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(RColorBrewer)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
us_births <- read_csv("~/Desktop/Summer A 2023/Data Viz/Projects/Ely_dataviz_mini-project_02/data/us_births_00_14.csv")
## Rows: 5479 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): day_of_week
## dbl (4): year, month, date_of_month, births
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spatial_births <- read_csv("~/Desktop/Summer A 2023/Data Viz/Projects/Ely_dataviz_mini-project_02/data/birth-rate-by-state.csv")
## Rows: 50 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (3): birthsPer1K, cdc2021Births, cbWomenWhoGaveBirth2021
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(us_births)
## # A tibble: 6 × 6
## year month date_of_month date day_of_week births
## <dbl> <dbl> <dbl> <date> <chr> <dbl>
## 1 2000 1 1 2000-01-01 Sat 9083
## 2 2000 1 2 2000-01-02 Sun 8006
## 3 2000 1 3 2000-01-03 Mon 11363
## 4 2000 1 4 2000-01-04 Tues 13032
## 5 2000 1 5 2000-01-05 Wed 12558
## 6 2000 1 6 2000-01-06 Thurs 12466
births_by_year <- us_births %>%
group_by(year) %>%
summarize(total_births=sum(births))
births_by_year %>%
ggplot(aes(x=year, y=total_births)) +
geom_line() +
labs(title="Distribution of Births in the US",
subtitle="2000 - 2014",
x="Year",
y="Number of Births") +
theme_light() +
theme(plot.title=element_text(face="bold", size=12)) +
scale_x_continuous(breaks=seq(2000, 2014, by=2))
- I wanted to get a better understanding of the data by doing a few
simpler visualizations. I started by plotting the number of birthdays
per year on a line. This allowed me to see a spike in births in 2007,
and a steep decline in 2009-2013. There are a few possibilities for
this, such as the 2008 recession.
births_month <- us_births %>%
group_by(month) %>%
summarize(count=sum(births))
births_month %>%
ggplot(aes(x=month, y = count)) +
geom_bar(stat="identity") +
labs(x="Month",
y="Number of Births",
title="Births per Month",
subtitle="2000-2014") +
coord_flip() +
scale_x_continuous(breaks=seq(1, 12, by=1)) +
theme_light() +
theme(plot.title=element_text(face="bold", size=12)) +
scale_fill_viridis_c(name="Area", labels = scales::comma) +
scale_y_continuous(labels = comma)
Another I wanted to look at were the months with the most births, which didn’t yield very much. The month with the most births is August, whereas the least is February. This is no surprise, months with more days, unlike February, have more births.
Originally, the number of births were in scientific notation, which bothered me as it becomes more difficult to understand. With the spatial visualization homework, I learned how to use scales and applied it here.
births_month_year <- us_births %>%
group_by(year, month) %>%
summarize(count=sum(births))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
viz1 <- births_month_year %>%
ggplot(aes(x=month, y=count, color=as.factor(year))) +
geom_line() +
labs(x="Month",
y="Number of Births",
title="Birth Counts by Month and Year",
color="Year") +
theme_light() +
scale_x_continuous(breaks=seq(1, 12, by=1)) +
theme(plot.title=element_text(face="bold", size=12))
interactive_plot <- ggplotly(viz1)
interactive_plot
htmlwidgets::saveWidget(interactive_plot, "birth_month_year.html")
birth_model <- lm(births ~ year + date_of_month, data=us_births)
tidy(birth_model)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 76707. 14571. 5.26 0.000000146
## 2 year -32.6 7.26 -4.48 0.00000748
## 3 date_of_month -1.40 3.56 -0.392 0.695
birth_coefs <- tidy(birth_model, conf.int=TRUE) %>%
filter(term!="(Intercept)")
birth_coefs
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 year -32.6 7.26 -4.48 0.00000748 -46.8 -18.3
## 2 date_of_month -1.40 3.56 -0.392 0.695 -8.39 5.59
birth_coefs %>%
ggplot(aes(x=estimate, y=fct_rev(term))) +
geom_pointrange(aes(xmin=conf.low, xmax=conf.high)) +
geom_vline(xintercept=0, color="red") +
theme_light()
- In this plot, it’s show that year has a far negative impact on birth
numbers, and date of month teeters between positive and negative at 0,
which means it has very little impact.
This was when I realized the US Births data would be difficult to use for the next plot, as it lacked geographical data.
I found data for state birth rates in 2021 from World Population Review to continue with this visualization. I also downloaded a shape file from census.gov.
library(sf)
states <- read_sf("~/Desktop/Summer A 2023/Data Viz/Projects/Ely_dataviz_mini-project_02/data/cb_2018_us_state_20m/cb_2018_us_state_20m.shp")
birth_rates <- spatial_births %>%
select(state, cdc2021Births) %>%
rename(births=cdc2021Births) %>%
rename(NAME=state)
To merge the shapefile and the data, I had to select and rename the variables to match up.
state_data <- left_join(states, birth_rates, by="NAME")
map_limits <- st_crs("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96")
state_data %>% ggplot() +
geom_sf(aes(fill=births), color="white") +
coord_sf(crs=map_limits) +
scale_fill_gradient(low="lightgreen", high="sienna2", name="Birth Rate", labels=scales::comma) +
theme_void() +
theme(legend.position="bottom",
legend.key.size=unit(1, 'cm'),
legend.key.height=unit(0.5, 'cm'),
legend.key.width=unit(1.5, 'cm'),
plot.title=element_text(face="bold", size=12)) +
labs(title="Birth Rates Across the US", subtitle="2021") +
scale_y_continuous(labels=comma)